Cosmos3 ModularPipeline#14110
Conversation
yiyixuxu
left a comment
There was a problem hiding this comment.
thanks for working on this!
I did an initial review - I mainly focus on encoder/decoder blocks for now. In modular, these blocks are meant to be run standalone ( e.g. an user encode an image once, keep the latent and reuse them across generations), or combined into a pipeline you can run end-to-end like a standard pipeline.
i will do another pass soon, let me know if you have any questions!
| @property | ||
| def expected_components(self) -> list[ComponentSpec]: | ||
| return [ | ||
| ComponentSpec("transformer", Cosmos3OmniTransformer), |
There was a problem hiding this comment.
| ComponentSpec("transformer", Cosmos3OmniTransformer), |
| ComponentSpec("vae", AutoencoderKLWan), | ||
| ComponentSpec("sound_tokenizer", Cosmos3AVAEAudioTokenizer), |
There was a problem hiding this comment.
| ComponentSpec("vae", AutoencoderKLWan), | |
| ComponentSpec("sound_tokenizer", Cosmos3AVAEAudioTokenizer), |
|
|
||
| @property | ||
| def description(self) -> str: | ||
| return "Validates inputs, tokenizes prompts, and packs text conditioning." |
There was a problem hiding this comment.
I think we can have this step to just run safety_checker + tokenize things, we want the text encoder block to be meaningful to run standalone, as well as combined into other blocks.
i.e., the user can run it once, keep the text segments, and reuse them across many generations with different resolutions/ conditional inputs/seeds etc
| InputParam(name="image", default=None), | ||
| InputParam(name="video", default=None), | ||
| InputParam(name="condition_frame_indexes_vision", default=(0, 1)), | ||
| InputParam(name="condition_video_keep", default="first"), |
There was a problem hiding this comment.
| InputParam(name="image", default=None), | |
| InputParam(name="video", default=None), | |
| InputParam(name="condition_frame_indexes_vision", default=(0, 1)), | |
| InputParam(name="condition_video_keep", default="first"), |
| InputParam(name="guidance_scale", type_hint=float, default=6.0), | ||
| InputParam(name="enable_sound", type_hint=bool, default=False), | ||
| InputParam(name="action", type_hint=CosmosActionCondition, default=None), |
There was a problem hiding this comment.
| InputParam(name="guidance_scale", type_hint=float, default=6.0), | |
| InputParam(name="enable_sound", type_hint=bool, default=False), | |
| InputParam(name="action", type_hint=CosmosActionCondition, default=None), |
| if isinstance(block_state.callback_on_step_end, (PipelineCallback, MultiPipelineCallbacks)): | ||
| block_state.callback_on_step_end_tensor_inputs = block_state.callback_on_step_end.tensor_inputs |
There was a problem hiding this comment.
| if isinstance(block_state.callback_on_step_end, (PipelineCallback, MultiPipelineCallbacks)): | |
| block_state.callback_on_step_end_tensor_inputs = block_state.callback_on_step_end.tensor_inputs |
we do not need to support pipeline callbacks in modular, since it is so easy to insert/swap blocks
| if block_state.width is None: | ||
| block_state.width = 1280 | ||
|
|
||
| components.check_inputs( |
There was a problem hiding this comment.
only need to check inputs used in this block (i think you cannot directly reuse the check_inputs method from the standard pipeline)
| condition_frame_indexes_vision=block_state.condition_frame_indexes_vision, | ||
| ) | ||
|
|
||
| block_state.action_mode = block_state.action.mode if block_state.action is not None else None |
There was a problem hiding this comment.
can we give action its own text block? a Cosmos3ActionTextStep that takes prompt + action and then build the action json prompt + resolution bining + tokenize ...
and then you can wrap this step( Cosmos3TextEncoderStep) and Cosmos3ActionTextStep into a AutoPipelineBlocks (e.g. Cosmos3AutoTextEncoderStep) triggered on action. this way each mode's text logic stays self-contained and more readable
see an example here: https://github.com/huggingface/diffusers/blob/main/src/diffusers/modular_pipelines/qwenimage/modular_blocks_qwenimage_edit.py#L200
this is an auto vae encoder step, but it should work similarly for text step as well
|
|
||
|
|
||
| logger = logging.get_logger(__name__) | ||
|
|
There was a problem hiding this comment.
Can you separate the VAE encoding from prepare_latent and add a proper Cosmos3VaeEncoderStep here?
We probably need a Cosmos3VaeEncoderStep (for i2v and v2v) and a Cosmos3ActionVaeEncoderStep, and pack them into an auto-step triggered on image/video/action.
similar to text step, the Vae encoder step should also be able to run standalone when needed - a user should be able to run just the vae encoder once, keep the latents and reuse them across generations.
| logger = logging.get_logger(__name__) | ||
|
|
||
|
|
||
| class Cosmos3DecodeStep(ModularPipelineBlocks): |
There was a problem hiding this comment.
I think we should split by modality as well, so Cosmo3VideoDecoderStep and Cosmos3SoundDecoderStep(the sound one can go into an auto block so it only runs if sound_latents is not None, like https://github.com/huggingface/diffusers/blob/main/src/diffusers/modular_pipelines/z_image/modular_blocks_z_image.py#L231)
Similar to encoder steps, the user should also be able to run a decoder step in standalone - so each block should just decode latent + safety checker, nothing else
The action-related code in the current block isn't decoding - I think it can probably go into its own block
What does this PR do?
Summary
Cosmos3OmniModularPipelineandCosmos3OmniBlocks.docs/source/en/api/pipelines/cosmos3.md.Test Plan
PYTHONPATH=src python -m pytest -q tests/pipelines/cosmos/test_cosmos3_modular_parity.py -vvBefore submitting
.ai/review-rules.md?documentation guidelines, and
here are tips on formatting docstrings.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.